AITopics | vit-b 32

Model merging offers a promising avenue for knowledge integration and parallel development without retraining. Yet, existing methods either ignore the geometry of the loss landscape or rely on intractable full-space Hessian approximations. We propose EpiMer, a framework that casts model merging as solving the Fréchet mean on a Riemannian manifold and restricts the computation to a low-rank subspace spanned by the task vectors. With the expected Hessian as the metric, we reveal a connection between local curvature and epistemic uncertainty of the parameters. Our theoretical analysis decomposes the merging error bound into the subspace Fréchet variance and the residual energy, and provides a closed-form characterization of when curvature-aware merging provably outperforms flat-geometry methods. In addition, our framework unifies both curvature-aware methods and recent spectral methods as special cases of the subspace Fréchet mean with different geometric metrics. Merging fine-tuned CLIP-ViT models on eight image classification tasks, Epistemic Merging strictly outperforms the baselines on all three CLIP-ViT backbones at matched rank, improving the across-task average accuracy and worst-task accuracy on every backbone.

artificial intelligence, epimer, machine learning, (19 more...)

arXiv.org Machine Learning

2605.26693

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

d28077e5ff52034cd35b4aa15320caea-Paper-Conference.pdf

Neural Information Processing SystemsApr-29-2026, 21:05:52 GMT

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)

Add feedback

e40b60677880e7e74f8a081f65703f0d-Supplemental-Conference.pdf

Neural Information Processing SystemsApr-28-2026, 04:37:55 GMT

artificial intelligence, caption, machine learning, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

dd59fad18638714e6c447a3b7b9c4160-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 09:38:45 GMT

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Country:

Asia > China > Hong Kong (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback

PyramidCLIP: HierarchicalFeatureAlignmentfor Vision-languageModelPretraining AnonymousAuthor(s) Affiliation Address email

Neural Information Processing SystemsFeb-12-2026, 14:53:05 GMT

Zhuang, K. Li, H. Cheng, X. Guo, F. Huang, R. Ji, and X. Sun, "Disco: Remedy213 self-supervised learning on lightweight models with distilled contrastive learning,"arXiv preprint214 arXiv:2104.09124,2021.215

artificial intelligence, machine learning, pyramidclip, (13 more...)

Neural Information Processing Systems

Country: Europe > Poland (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Appendix

Neural Information Processing SystemsFeb-9-2026, 01:46:56 GMT

For vision transformers, we train linear probes on representations from individual tokens or on the representation averaged over all tokens, at the output of different transformer layers (each layer meaning a full transformer block including self-attention and MLP). Moreover, ResNets differ from ViTs in that the number of channels changes throughout the model, with fewer channels in the earlier layers. Wetrain alinear probe on each individual token and plot the average accuracy over the test set, in percent. Here we plot the results for each token a subset of layers in 3models: ViT-B/32 trained with aclassification token (CLS) or global average pooling (GAP), as well as a ResNet50. There are two main observations tobemade.

artificial intelligence, figurec, representation, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.36)

Add feedback

123a18dfd821c8b440f42a00a27648d6-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 01:25:22 GMT

category, dataset, scenario, (16 more...)

Neural Information Processing Systems

Country:

North America > United States (0.05)
Europe > Austria > Styria > Graz (0.05)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.35)

Add feedback

Estimating the Effective Rank of Vision Transformers via Low-Rank Factorization

Zerihun, Liyu

arXiv.org Artificial IntelligenceDec-2-2025

Deep networks are heavily over-parameterized, yet their learned representations often admit low-rank structure. We introduce a framework for estimating a model's intrinsic dimensionality by treating learned representations as projections onto a low-rank subspace of the model's full capacity. Our approach: train a full-rank teacher, factorize its weights at multiple ranks, and train each factorized student via distillation to measure performance as a function of rank. We define effective rank as a region, not a point: the smallest contiguous set of ranks for which the student reaches 85-95% of teacher accuracy. To stabilize estimates, we fit accuracy vs. rank with a monotone PCHIP interpolant and identify crossings of the normalized curve. We also define the effective knee as the rank maximizing perpendicular distance between the smoothed accuracy curve and its endpoint secant; an intrinsic indicator of where marginal gains concentrate. On ViT-B/32 fine-tuned on CIFAR-100 (one seed, due to compute constraints), factorizing linear blocks and training with distillation yields an effective-rank region of approximately [16, 34] and an effective knee at r* ~ 31. At rank 32, the student attains 69.46% top-1 accuracy vs. 73.35% for the teacher (~94.7% of baseline) while achieving substantial parameter compression. We provide a framework to estimate effective-rank regions and knees across architectures and datasets, offering a practical tool for characterizing the intrinsic dimensionality of deep models.

artificial intelligence, distillation, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2512.00792

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.32)

Add feedback

Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning

Peleg, Amit, Singh, Naman Deep, Hein, Matthias

arXiv.org Artificial IntelligenceOct-29-2025

Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities in classification and retrieval. However, these models often struggle with compositional reasoning - the ability to understand the relationships between concepts. A recent benchmark, SugarCrepe++, reveals that previous works on improving compositionality have mainly improved lexical sensitivity but neglected semantic understanding. In addition, downstream retrieval performance often deteriorates, although one would expect that improving compositionality should enhance retrieval. In this work, we introduce CLIC (Compositionally-aware Learning in CLIP), a fine-tuning method based on a novel training technique combining multiple images and their associated captions. CLIC improves compositionality across architectures as well as differently pre-trained CLIP models, both in terms of lexical and semantic understanding, and achieves consistent gains in retrieval performance. This even applies to the recent CLIPS, which achieves SOTA retrieval performance. Nevertheless, the short fine-tuning with CLIC leads to an improvement in retrieval and to the best compositional CLIP model on SugarCrepe++. All our models and code are available at https://clic-compositional-clip.github.io

caption, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2505.24424

Country: Europe > Germany (0.28)

Genre: Research Report (1.00)

Technology: